code
share


Chapter 12: Introduction to nonlinear learning

12.8 Efficient

In this Section we introduce a second general paradigm for effective model search - or the effective search for a proper capacity model. With the first apprach discussed in the previous Section - boosting - we took a 'bottom-up' approach to fine tuning the proper amount of capacity a model needs: that is we began with a low capacity (and likely underfitting) model and then gradually increased its capacity by adding additional units (from the same family of universal approximators) until we built up 'just enough' capacity (that is the amount that minimizes validation error).

In this Section we introduce the complementary approach - called regularization. Instead of building up capacity 'starting at the bottom', with regularization we take a 'top-down' view and start off with a high capacity model - that is one which would likely overfit, providing a low training error but high validation error - and gradually decrease its capacity until the capacity is 'just right' (that is, until validation error is minimized).

In [1]:

12.8.1 Optimization and the problem of overfitting

Imagine we have a simple nonlinear regression problem, like the one shown in the left panel of the Figure below, and we use a single model - made up of a sum of universal approximators of a given type - with far too much capacity to try to fit this data properly. In other words, we train our high capacity model on a training portion of this data via minimizing an appropriate cost function like e.g., the Least Squares cost. In the left panel we also show a corresponding fit provided by our overfitting model in red, which wildly overfits the data.

Figure 1: (left panel) A generic nonlinear regression dataset, along with a high capacity `model` overfitting its training portion (with the fit shown in red). (right panel) A *figurative illustration* of the cost function associated to this `model` (i.e., we show it as a taking in only a single input for visualization purposes only). Here the set of parameters associated with our overfitting `model` are those near the minimum of this cost, highlighted here with a red dot.

In a high capacity model like this one we have clearly used too many and/or too flexible universal approximators (feature transformations). Equally important to diagnosing the problem of overfitting is how well we tune our model's parameters or - in other words - how well we minimize its corresponding cost function. In the present case for example, the parameter setting of our model in the middle panel that overfit our training data come from near the minimum of the model's cost function. This cost function is drawn figuratively in the right panel, where the minimum is shown as a red point. This is true in general as well - regardless of how many feature transformations we use a model will overfit a training set only when we tune its parameters well or, in other words, when we minimize its corresponding cost function well. Conversely, even if we use a high capacity model, if we do not tune its parameters well a model will not overfit its training data.

Regardless of how many feature transformations we use, in general a model will overfit a training set only when we tune its parameters well or, in other words, when we minimize its corresponding cost function well. Conversely, even if we use a high capacity model, if we do not tune its parameters well this model will not overfit its training data.

Figure 2: A version of the previous Figure, only now we show the result of two fits. Our (training-set) overfit is shown once again in red in the left panel, and the evaluation of these parameters via the associated cost function is shown figuratively in the right panel as a red dot. Here however we also show a second fit in blue provided by a set of weights that are not near the global minimum of the cost, with their evaluation via the cost shown as a blue dot in the right panel. Because these parameters do not minimize the cost function they do not overfit the training data, and provide a better representation of the overall dataset.

12.8.2 Regularization: the 'top-down' approach to capacity tuning

Regularization techniques for capacity tuning leverage precisely this overfitting-optimization connection, and in general work by preventing the complete minimization of a cost function associated with a high capacity model. In other words, with regularization techniques in general we use a high capacity model and tune its parameters to indirectly encourage a good validation fit by preventing complete minimization of its associated cost function (and thus preventing overfitting to the training data). Contrary to boosting techniques, where we started 'from the bottom' and built up a more flexible model piece-by-piece by adding single feature transformations to it, regularization starts 'from the top' - the 'top' being a high capacity model - and tempers its capacity by imperfectly minimizing its corresponding cost function. This is somewhat of an indirect way of getting at a good overall fitting model - since what we are doing is directly preventing overfiting to training data.

Contrary to boosting techniques, where we started 'from the bottom' and built up a flexible model piece-by-piece by adding single feature transformations to it, regularization starts 'from the top' - the 'top' being a high capacity model - and tempers its capacity by imperfectly minimizing its corresponding cost function.

Regularization can be performed a variety of ways, but in general there are two basic categories of strategies which we will discuss: early stopping and the addition of a simple capacity-blunting function to the cost.

12.8.3 Early stopping

Here the idea is to literally stop the optimization procedure early, before reaching a minimum / overfitting occurs. This is done by measuring validation error during optimization, and (roughly speaking) halting the procedure when validation error is minimal.

As with any form of capacity-tuning, our ideal is to find a model that provides the lowest possible error on the validation set. With early-stopping we do this by stopping the minimization of our cost function (which is measuring training error) when validation error reaches its lowest point. The basic idea is illustrated in the figure below. In the left panel we show a prototypical nonlinear regression dataset, and in the middle the cost function of a high capacity model shown figuratively in two dimensions. As we begin a run of a local optimization method we measure both the training error (provided by the cost function we are minimizing) as well as validation error at each step of the procedure - as shown in the right panel. We try to halt the procedure when the validation error has reached its lowest point.

Figure 2: (left panel) A prototypical nonlinear regression dataset, (middel panel) a figurative illustration of the cost associated with a high capacity model, and the measurment of training / validation error at each step of a local optimization procedure. With *early stopping* we make a run of a local optimization procedure and measure both the training and validation error at each step. We try to halt the procedure when the validation error reaches its lowest value, with the corresponding set of weights providing our high capacity `model` with the least chance of overfitting our training data (and hopefully providing a good fit to the entire dataset).

There are a number of important engineering details associated with making an effective early-stopping procedure, these include.

  • When is validation error really at its lowest? While generally speaking validation error decreases at the start of an optimization run and eventually increases (making somewhat of a 'U' shape) it can certainly fluctuate up and down during optimization, so it is not all together obvious when the validation error has indeed reached its lowest point. To deal with this peculiarity often in practice a reasonable engineering choice is made as to when to stop based on how long it has been since the validation error has not decreased.
  • Large (local optimization) steps are to be avoided. The idea with early stopping is to measure training / validation error often as an optimization procedure makes progress, so that the procedure can be halted when validation error is low. If one uses a local optimization procedure that takes very large steps - like e.g., Newton's method or a stochastic / mini-batch first order approach - optimization can quickly lead to weights that overfit the training data (in other words a set of weights that provide minimal validation error can be skipped over entirely). Thus when employing early stopping one needs to use a local optimization method with moderate-lengthed steps.
  • Validation error should be measured often. Validation error should be measured frequently during the minimization process in order to determine a validation error minimizing set of weights. When employing a mini-batch / stochastic first order method validation error should be measured several times per eopch to avoid taking too large of steps without measuring validation error (and perhaps skipping over error minimizing weights entirely).

Below we show a few examples employing the early stopping regularization strategy.

Example 1. Early stopping of a prototypical regression dataset

Below we plot a prototypical nonlinear regression dataset. We will use early stopping regularization to fine tune the capacity of a model consisting of $10$ tanh neural network universal approximators.

In [52]:

Below we illustrate a large number of gradient descent steps to tune our high capacity model for this dataset. As you move the slider left to right you can see the resulting fit at each highlighted step of the run in the original dataset (top left), training (bottom left), and validation data (bottom right). Moving the slider to where the validation error is lowest provides - for this training / validation split of the original data - a fine nonlinear model for the entire dataset.

In [51]:
Out[51]:



Example 2. Early stopping of a prototypical classification dataset

Below we plot a prototypical nonlinear classification dataset. We will use early stopping regularization to fine tune the capacity of a model consisting of $5$ tanh neural network universal approximators.

In [6]:

Below we illustrate a large number of gradient descent steps to tune our high capacity model for this dataset. As you move the slider left to right you can see the resulting fit at each highlighted step of the run in the original dataset (top left), training (bottom left), and validation data (bottom right). Moving the slider to where the validation error is lowest provides - for this training / validation split of the original data - a fine nonlinear model for the entire dataset.

In [8]:
Out[8]:



12.8.4 Adding a simple, capacity blunting function to the cost

By adding a simple function to the cost function we change its shape, and in particular change the location of its global minima. Since the global minima of the adjusted cost function do not allign with those of the original cost, the adjusted cost can then be completely minimized with less fear of overfitting to the training data. This method of regularization is illustrated via the animation below. In the left panel we have a prototypical single input cost function $g(w)$, in the middle a simple function - here a quadratic $w^2$ - which we will add to the cost in order to 'blunt' it, and in the right panel their linear combination $g(w) + \lambda w^2$. As we increase $\lambda > 0$ notice how the cost's global minimum moves - in this case to the left. Thus a complete minimization of the joint function will not reach the global minimum of the original cost, and overfitting is prevented.

In [58]:
Out[58]:



TO BE CONTINUED....

Figure 4:

Example 3. Application to a prototypical regression dataset

Example 4. Application to a prototypical classification dataset